[5 points]
Do not use grid.arrange() for this exercise. Rather, use gather() to tidy the data and then facet on window number. To make the comparison, use relative frequency bar charts (the heights of the bars in each facet sum to one). Describe how the distributions differ.
df <- gather(DAAG::vlt, key="window", value="symbol", window1:window3)
ggplot(df) +
geom_bar(aes(x=symbol, y=..prop.., group=window)) +
facet_wrap(~window)
0, appearing most often in each window.symbol 0. It’s most likely nothing. Especially in window1.[21 points]
This data comes from the “Detailed Mortality” database available on https://wonder.cdc.gov/
df <- read.csv("../data/Death2015.txt", sep="\t")
notes <- tail(df, 44)
df <- head(df, -44)
Code for all preprocessing must be shown. (That is, don’t open in the file in Excel or similar, change things around, save it, and then import to R. Why? Because your steps are not reproducible.)
Place of Death, Ten-Year Age Groups, and ICD Chapter Code variables, do the following:Identify the type of variable (nominal, ordinal, or discrete) and draw a horizontal bar chart using best practices for order of categories.
d1.title <- ggtitle("Place of Death: Nominal Data")
d1 <- ggplot(df) +
geom_bar(aes(Place.of.Death)) +
coord_flip() +
xlab("")
d2.title <- ggtitle("Ten Year Age Groups: Ordinal Data")
d2 <- ggplot(df) +
geom_bar(aes(Ten.Year.Age.Groups)) +
coord_flip() +
xlab("")
d3.title <- ggtitle("ICD Chapter Code: Nominal Data")
d3 <- ggplot(df) +
geom_bar(aes(ICD.Chapter.Code)) +
coord_flip() +
xlab("")
grid.arrange(d1 + d1.title, d2 + d2.title, d3 + d3.title)
scales = "free" with facet_wrap(). It should look like this (with data, of course!). Describe notable features.ggplot(df, aes(ICD.Sub.Chapter.Code)) +
geom_bar() +
facet_wrap(~ICD.Chapter.Code, ncol=3, scales = "free") +
coord_flip()
U00-U99 only has a single Sub-Chapter, it self. As a result it is very uniform. It also doesn’t have much data.H60-H93.scales parameter to scales = "free_y". What changed? What information does this set of graphs provide that wasn’t available in part (b)?ggplot(df, aes(ICD.Sub.Chapter.Code)) +
geom_bar() +
facet_wrap(~ICD.Chapter.Code, ncol=3, scales = "free_y") +
coord_flip()
ggplot(df, aes(ICD.Sub.Chapter.Code)) +
geom_bar(aes(x=ICD.Sub.Chapter.Code, y=..prop.., group=ICD.Chapter.Code)) +
facet_wrap(~ICD.Chapter.Code, ncol=3, scales = "free_y") +
coord_flip()
ggplot(df, aes(ICD.Sub.Chapter.Code)) +
geom_bar(aes(x=ICD.Sub.Chapter.Code, y=..prop.., group=ICD.Chapter.Code)) +
facet_wrap(~ICD.Chapter.Code, ncol=3, scales = "free") +
coord_flip()
free and free_y) to show relative frequency in the bar charts for each chapter.ICD Chapter and ICD Sub-Chapter instead of the code versions.) What type of data is this? Note any interesting features.data = subset(df, ICD.Chapter.Code %in% c("H60-H93"))
ggplot(data, aes(ICD.Sub.Chapter)) +
geom_bar(aes(x=ICD.Sub.Chapter)) +
xlab(unique(data$ICD.Chapter)) +
ylab("Death Count") +
coord_flip()
Diseases of the ear and mastoid process we could see the distribution of deaths for each Sub Chapter.Diseases of middle ear and mastoid. Not sure why that is. Could be because of little data.[6 points]
Cite your sources with links.
for(i in notes$Notes){
if(grepl("Query Date", i)){
print(i)
}
}
[1] "Query Date: Feb 5, 2018 5:08:43 PM"
Underlying Cause of Death 1999-2016, specifically 2015, for all states in U.S. (inside the data)National Vital Statistics System, National Health Interview Survey, National Health and Nutrition Examination Survey, National Health Care Surveys. (wikipedia and https://www.cdc.gov/nchs/data/factsheets/factsheet_health_statistics.htm)[12 points]
Explore length vs. year in the ggplot2movies data set, after removing outliers. (Choose a reasonable cutoff).
Draw four scatterplots of length vs. year from the with the following variations:
df <- subset(ggplot2movies::movies, length < 200)
ggplot(df, aes(x=length, y=year)) +
geom_point(alpha=.1)
ggplot(df, aes(x=length, y=year)) +
geom_point(alpha=.1) +
geom_density2d(bins=10)
ggplot(df, aes(length, year)) +
geom_hex(bins=30)
ggplot(df, aes(length, year)) +
geom_bin2d(binwidth=c(10, 10)) +
coord_equal()
For all, adjust parameters to the levels that provide the best views of the data.
b is more helpful in discerning the true dispersion of the data. We could see where the majority of data lies and its density and concentration.c is the more difficult to read. We generally have a harder time attributing color to values. As a result, I can’t tell that there’s a gap in length between 10 min to 60 min.a communicates that there’s a gap in data. However, it is more difficult to see how the data is spread.d and c feels very similar and is difficult to read. The only thing that I could tell from these graphs is that most length is concentrated at the 100 minute.[6 points] The leaf shape dataset in the DAAG package includes three measurements on each leaf (length, width, petiole) and the logarithms of the three measurements. (a) Draw sploms for the two sets of three variables. What conclusions would you draw from each set? Which do you find more useful?
df <- DAAG::leafshape
splom(df[c(1,2,3)], main = "Leaf length, width and petiole (mm)")
splom(df[c(5,6,7)], main = "Log of Leaf length, width and petiole")
df <- DAAG::leafshape
super.sym <- trellis.par.get("superpose.symbol")
key <- list(title = "Leaf Architecture",
points = list(pch = super.sym$pch[1:2],
col = super.sym$col[1:2]),
text = list(c("Plagiotropic", "Orthotropic")))
splom(~df[1:3], groups=arch, data=df, panel=panel.superpose, key=key)
splom(~df[5:7], groups=arch, data=df, panel=panel.superpose, key=key)